We present results of the implementation of one MILC lattice QCDapplication-simulation with dynamical clover fermions using thehybrid-molecular dynamics R algorithm-on the Cell Broadband Engine processor.Fifty-four individual computational kernels responsible for 98.8% of theoverall execution time were ported to the Cell's Synergistic ProcessingElements (SPEs). The remaining application framework, including MPI-baseddistributed code execution, was left to the Cell's PowerPC processor. Weobserve that we only infrequently achieve more than 10 GFLOPS with any of thekernels, which is just over 4% of the Cell's peak performance. At the sametime, many of the kernels are sustaining a bandwidth close to 20 GB/s, which is78% of the Cell's peak. This indicates that the application performance islimited by the bandwidth between the main memory and the SPEs. In spite of thislimitation, speedups of 8.7x (for 8x8x16x16 lattice) and 9.6x (for 16x16x16x16lattice) were achieved when comparing a 3.2 GHz Cell processor to a single coreof a 2.33 GHz Intel Xeon processor. When comparing the code scaled up toexecute on a dual-Cell blade and a quad-core dual-chip Intel Xeon blade, thespeedups are 1.5x (8x8x16x16 lattice) and 4.1x (16x16x16x16 lattice).
展开▼